智能论文笔记

IndicMT Eval: A Dataset to Meta-Evaluate Machine Translation metrics for Indian Languages

Ananya B. Sai , Vignesh Nagarajan , Tanay Dixit , Raj Dabre , Anoop Kunchukuttan , Pratyush Kumar , Mitesh M. Khapra

分类：自然语言处理

2022-12-20

The rapid growth of machine translation (MT) systems has necessitated comprehensive studies to meta-evaluate evaluation metrics being used, which enables a better selection of metrics that best reflect MT quality. Unfortunately, most of the research focuses on high-resource languages, mainly English, the observations for which may not always apply to other languages. Indian languages, having over a billion speakers, are linguistically different from English, and to date, there has not been a systematic study of evaluating MT systems from English into Indian languages. In this paper, we fill this gap by creating an MQM dataset consisting of 7000 fine-grained annotations, spanning 5 Indian languages and 7 MT systems, and use it to establish correlations between annotator scores and scores obtained using existing automatic metrics. Our results show that pre-trained metrics, such as COMET, have the highest correlations with annotator scores. Additionally, we find that the metrics do not adequately capture fluency-based errors in Indian languages, and there is a need to develop metrics focused on Indian languages. We hope that our dataset and analysis will help promote further research in this area.

translated by 谷歌翻译

Naamapadam: A Large-Scale Named Entity Annotated Data for Indic Languages

Arnav Mhaske , Harshit Kedia , Sumanth Doddapaneni , Mitesh M. Khapra , Pratyush Kumar , Rudra Murthy V , Anoop Kunchukuttan

分类：自然语言处理

2022-12-20

We present, Naamapadam, the largest publicly available Named Entity Recognition (NER) dataset for the 11 major Indian languages from two language families. In each language, it contains more than 400k sentences annotated with a total of at least 100k entities from three standard entity categories (Person, Location and Organization) for 9 out of the 11 languages. The training dataset has been automatically created from the Samanantar parallel corpus by projecting automatically tagged entities from an English sentence to the corresponding Indian language sentence. We also create manually annotated testsets for 8 languages containing approximately 1000 sentences per language. We demonstrate the utility of the obtained dataset on existing testsets and the Naamapadam-test data for 8 Indic languages. We also release IndicNER, a multilingual mBERT model fine-tuned on the Naamapadam training set. IndicNER achieves the best F1 on the Naamapadam-test set compared to an mBERT model fine-tuned on existing datasets. IndicNER achieves an F1 score of more than 80 for 7 out of 11 Indic languages. The dataset and models are available under open-source licenses at https://ai4bharat.iitm.ac.in/naamapadam.

translated by 谷歌翻译

IndicXTREME: A Multi-Task Benchmark For Evaluating Indic Languages

Sumanth Doddapaneni , Rahul Aralikatte , Gowtham Ramesh , Shreya Goyal , Mitesh M. Khapra , Anoop Kunchukuttan , Pratyush Kumar

分类：自然语言处理

2022-12-11

In this work, we introduce IndicXTREME, a benchmark consisting of nine diverse tasks covering 18 languages from the Indic sub-continent belonging to four different families. Across languages and tasks, IndicXTREME contains a total of 103 evaluation sets, of which 51 are new contributions to the literature. To maintain high quality, we only use human annotators to curate or translate\footnote{for IndicXParaphrase, where an automatic translation system is used, a second human verification and correction step is done.} our datasets. To the best of our knowledge, this is the first effort toward creating a standard benchmark for Indic languages that aims to test the zero-shot capabilities of pretrained language models. We also release IndicCorp v2, an updated and much larger version of IndicCorp that contains 20.9 billion tokens in 24 languages. We pretrain IndicBERT v2 on IndicCorp v2 and evaluate it on IndicXTREME to show that it outperforms existing multilingual language models such as XLM-R and MuRIL.

translated by 谷歌翻译

Towards Building Text-To-Speech Systems for the Next Billion Users

Gokul Karthik Kumar , Praveen S V , Pratyush Kumar , Mitesh M. Khapra , Karthik Nandakumar

分类：自然语言处理 | 机器学习

2022-11-17

Deep learning based text-to-speech (TTS) systems have been evolving rapidly with advances in model architectures, training methodologies, and generalization across speakers and languages. However, these advances have not been thoroughly investigated for Indian language speech synthesis. Such investigation is computationally expensive given the number and diversity of Indian languages, relatively lower resource availability, and the diverse set of advances in neural TTS that remain untested. In this paper, we evaluate the choice of acoustic models, vocoders, supplementary loss functions, training schedules, and speaker and language diversity for Dravidian and Indo-Aryan languages. Based on this, we identify monolingual models with FastPitch and HiFi-GAN V1, trained jointly on male and female speakers to perform the best. With this setup, we train and evaluate TTS models for 13 languages and find our models to significantly improve upon existing models in all languages as measured by mean opinion scores. We open-source all models on the Bhashini platform.

translated by 谷歌翻译

Deep Convolutional Architectures for Extrapolative Forecast in Time-dependent Flow Problems

Pratyush Bhatt , Yash Kumar , Azzeddine Soulaimani

分类：机器学习

2022-09-18

动力学受部分微分方程（PDE）控制的物理系统在许多领域（从工程设计到天气预报）中找到了应用。从此类PDE中获取解决方案的过程对于大规模和参数化问题的计算昂贵。在这项工作中，使用LSTM和TCN等时间表预测开发的深度学习技术，或用于为CNN等空间功能提取而开发的，用于建模系统动力学，以占主导问题。这些模型将输入作为从PDE获得的连续时间步长的一系列高保真矢量解，并预测使用自动回归的后续时间步长的解决方案；从而减少获得此类高保真解决方案所需的计算时间和功率。这些模型经过数值基准测试（1D汉堡的方程式和Stoker的大坝断裂问题），以评估长期预测准确性，甚至在训练域之外（外推）。在向预测模型输入之前，使用非侵入性的降低订购建模技术（例如深度自动编码网络）来压缩高保真快照，以减少在线和离线阶段的复杂性和所需的计算。深层合奏被用来对预测模型进行不确定性量化，该模型提供了有关认知不确定性导致预测方差的信息。

translated by 谷歌翻译

Effectiveness of Mining Audio and Text Pairs from Public Data for Improving ASR Systems for Low-Resource Languages

Kaushal Santosh Bhogale , Abhigyan Raman , Tahir Javed , Sumanth Doddapaneni , Anoop Kunchukuttan , Pratyush Kumar , Mitesh M. Khapra

分类：自然语言处理

2022-08-26

端到端（E2E）模型已成为最新语音识别系统的默认选择。此类型号经过大量标记数据的培训，这些数据通常无法用于低资源语言。诸如自我监督学习和转移学习的诺言之类的技术尚未在培训准确的模型中有效。另一方面，在各种域和扬声器集合上收集标记的数据集非常昂贵。在这项工作中，我们通过公共资料中的印度语言，特别是来自印度广播电台的公共档案馆的印度语言的``采矿''文本和音频对展示了这些方法的廉价和有效替代方案。作为关键组件，我们将Needleman-Wunsch算法调整为与相应的音频片段对齐句子，并给定长音频和其转录本的PDF，同时由于OCR，无关紧要的文本和未转录的语音而对错误进行了强大的态度。因此，我们创建了Shrutilipi，这是一个数据集，其中包含超过6,400个小时的12个印度语言标签的音频，总计为495万个句子。平均而言，Shrutilipi导致2.3倍增加了公开可用的标签数据。我们在12种语言中与21种人类评估者建立了Shrutilipi的质量。我们还根据代表区域，说话者和提到的实体建立了Shrutilipi的多样性。值得注意的是，我们表明，将Shrutilipi添加到WAV2VEC模型的训练集中，导致在Indicsuperb基准上的7种语言中，平均降低了5.8 \％。对于具有最多基准的印地语（7），平均水平从18.8％下降到13.5％。这种改进扩展到有效的模型：对于构象异构体模型（比WAV2VEC小10倍），我们显示出2.3％的下降。最后，我们通过证明对其进行训练的模型对嘈杂的输入更强大，证明了Shrutilipi的多样性。

translated by 谷歌翻译

HTML版本

Efficient ML Models for Practical Secure Inference

Vinod Ganesan , Anwesh Bhattacharya , Pratyush Kumar , Divya Gupta , Rahul Sharma , Nishanth Chandran

分类：机器学习

2022-08-26

ML-AS-A-Service继续增长，对非常强大的隐私保证的需求也在继续增长。安全推断已成为潜在的解决方案，其中加密原始图允许推理不向用户向用户揭示用户的输入或模型的权重。例如，模型提供商可以是一家诊断公司，该公司已经培训了一种最先进的Densenet-121模型来解释胸部X射线，并且用户可以在医院成为患者。尽管对于这种环境，确保推理原则上是可行的，但没有现有的技术使其大规模实用。 Cryptflow2框架提供了一种潜在的解决方案，其能力自动，正确地将清晰文本推理转换为安全模型的推断。但是，从Cryptflow2产生的安全推断在不切实际上很昂贵：在Densenet-121上解释单个X射线需要几乎3TB的通信。在本文中，我们解决了针对三项贡献的安全推断效率低下的重大挑战。首先，我们证明安全推理中的主要瓶颈是大型线性层，可以通过选择网络骨干的选择来优化，并使用用于有效的清晰文本推理开发的操作员。这一发现和强调与许多最近的作品偏离，这些作品着重于在执行较小网络的安全推断时优化非线性激活层。其次，基于对瓶颈卷积层的分析，我们设计了一个更有效的倒入替代品的X操作器。第三，我们表明，快速的Winograd卷积算法进一步提高了安全推断的效率。结合使用，这三个优化被证明对在CHEXPERT数据集中训练的X射线解释问题非常有效。

translated by 谷歌翻译

IndicSUPERB: A Speech Processing Universal Performance Benchmark for Indian languages

Tahir Javed , Kaushal Santosh Bhogale , Abhigyan Raman , Anoop Kunchukuttan , Pratyush Kumar , Mitesh M. Khapra

分类：自然语言处理

2022-08-24

AI研究中的基石是创建和采用标准化培训和测试数据集，以指定最新模型的进度。一个特别成功的例子是用于培训和评估英语自然语言理解（NLU）模型的胶水数据集。围绕基于BERT的语言模型的大量研究围绕着胶水中NLU任务的性能改进。为了评估其他语言的语言模型，创建了几个特定语言的胶水数据集。语音语言理解（SLU）的领域遵循了类似的轨迹。大型自我监督模型（例如WAV2VEC2）的成功实现了具有相对易于访问的未标记数据的语音模型。然后可以在SLU任务（例如出色的基准测试）上评估这些模型。在这项工作中，我们将其扩展到通过释放Indicsuperb基准测试来指示语言。具体来说，我们做出以下三项贡献。（i）我们收集了Kathbath，其中包含来自印度203个地区的1,218个贡献者的12个印度语言的1,684小时的标记语音数据。（ii）使用Kathbath，我们在6个语音任务中创建基准：自动语音识别，扬声器验证，说话者识别（单声道/多），语言识别，逐个示例查询以及对12种语言的关键字发现。（iii）在发布的基准测试中，我们与常用的基线Fbank一起训练和评估不同的自我监督模型。我们表明，在大多数任务上，特定于语言的微调模型比基线更准确，包括对于语言识别任务的76 \％差距。但是，对于说话者识别，在大型数据集上训练的自我监督模型证明了一个优势。我们希望Indicsuperb有助于发展印度语言的语音语言理解模型的进步。

translated by 谷歌翻译

A CNN based method for Sub-pixel Urban Land Cover Classification using Landsat-5 TM and Resourcesat-1 LISS-IV Imagery

Krishna Kumar Perikamana , Krishnachandran Balakrishnan , Pratyush Tripathy

分类：计算机视觉 | 机器学习

2021-12-16

城市土地覆盖的时间序列数据在分析城市增长模式方面具有很大的效用，不透水表面和植被的分布变化以及对城市微观气候产生影响。虽然Landsat数据非常适于这种分析，但由于长时间系列的免费图像，传统的每像素硬分类未能产生Landsat数据的全部潜力。本文提出了一种子像素分类方法，其利用Landsat-5 TM和Resorational-1 Liss-IV传感器的时间重叠。我们训练卷积神经网络，预测30米Landsat-5 TM数据的分数陆地覆盖。从2011年的Bengaluru的一个艰难的5.8M Liss-IV图像估计参考陆地覆盖分数。此外，我们从2009年使用Mumbai数据并将其与使用的结果进行了概括和卓越的性能随机森林分类器。对于Bengaluru（2011）和Mumbai（2009）数据，我们的CNN模型的平均绝对百分比误差在30M细胞水平上的内置和植被分数预测的7.2至11.3。与最近的最近的研究不同，在使用数据在空间范围进行有限的空间范围进行验证，我们的模型已经过度培训并验证了两个不同时间段的两个Mega城市的完整空间范围的数据。因此，它可以可靠地从Landsat-5 TM时间序列数据中可靠地产生30M内置和植被分数图，以分析长期城市增长模式。

translated by 谷歌翻译

Towards Building ASR Systems for the Next Billion Users

Tahir Javed , Sumanth Doddapaneni , Abhigyan Raman , Kaushal Santosh Bhogale , Gowtham Ramesh , Anoop Kunchukuttan , Pratyush Kumar , Mitesh M. Khapra

分类：自然语言处理

2021-11-06

最近的言语和语言技术的方法预先rain非常大型模型，用于特定任务。然而，这种大型模型的好处通常仅限于世界上少数资源丰富的语言。在这项工作中，我们对来自印度次大陆的低资源语言构建ASR系统进行多种贡献。首先，我们从各种领域策划40个印度语言的17,000小时的原始语音数据，包括教育，新闻，技术和金融。其次，使用这种原始语音数据，我们预先存在于40个印度语言的Wav2Vec样式模型的多个变体。第三，我们分析佩带的模型以查找关键特点：码本矢量的类似探测音素在语言中共享，跨层的表示是语言系列的判别，并且注意力头通常会在小型本地窗口中注意。第四，我们微调了9种语言的下游ASR模型，并在3个公共数据集上获得最先进的结果，包括非常低的资源语言，如Sinhala和Nepali。我们的工作建立了多语言预介质是建立ASR系统的有效策略，为印度次大陆的语言上不同的扬声器建立ASR系统。

translated by 谷歌翻译